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We consider a new construction of locality-sensitive hash functions for Hamming space that is covering 
in the sense that is it guaranteed to produce a collision for every pair of vectors within a given radius r. 
The construction is efficient in the sense that the expected number of hash collisions between vectors at 
distance cr, for a given c > 1, comes close to that of the best possible data independent LSH without the 
covering guarantee, namely, the seminal LSH construction of Indyk and Motwani (STOC ’98). The efficiency 
of the new construction essentially matches their bound when the search radius is not too large — e.g., when 
cr = o(log(n)/loglogn), where n is the number of points in the data set, and when cr = log (n)/k where k is 
an integer constant. In general, it differs by at most a factor ln(4) in the exponent of the time bounds. As a 
consequence, LSH-based similarity search in Hamming space can avoid the problem of false negatives at 
little or no cost in efficiency. 
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1. INTRODUCTION 

Similarity search in high dimensions has been a subject of intense research for the last 
decades in several research communities including theory of computation, databases, 
machine learning, and information retrieval. In this paper we consider nearest neighbor 
search in Hamming space, where the task is to find a vector in a preprocessed set 
S £ {0,1} d that has minimum Hamming distance to a query vector y € {0 ,\} d . 

It is known that efficient data structures for this problem, i.e., whose query and 
preprocessing time does not increase exponentially with would disprove the strong 
exponential time hypothesis [Williams 2005; Alman and Williams 20151. For this reason 
the algorithms community has studied the problem of finding a c-approximate nearest 
neighbor, i.e., a point whose distance to y is bounded by c times the distance to a nearest 
neighbor, where c > 1 is a user-specified parameter. If the exact nearest neighbor is 
sought, the approximation factor c can be seen as a bound on the relative distance 
between the nearest and the second nearest neighbor. All existing c-approximate nearest 
neighbor data structures that have been rigorously analyzed have one or more of the 
following drawbacks: 


(1) Worst case query time linear in the number of points in the data set, or 

(2) Worst case query time that grows exponentially with d, or 

(3) Multiplicative space overhead that grows exponentially with d, or 
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(4) Lack of unconditional guarantee to return a nearest neighbor (or c-approximate 
nearest neighbor). 

Arguably, the data structures that come closest to overcoming these drawbacks are 
based on locality-sensitive hashing (LSH). For many metrics, including the Hamming 
metric discussed in this paper, LSH yields sublinear query time (even for d » log n) and 


space usage that is polynomial in n and linear in the number of dimensions [Indyk and 


Motwani 1998 Gionis et al. 19991. If the approximation factor c is larger than a certain 
constant (currently know n to be at most 3) the space can even be made O ( nd), still 
with sublinear query time [Panigrah y 2006[ Kapralov 2015 Laarhoven 2015| . 

However, these methods come with a Monte Carlo-type guarantee: A c-approximate 
nearest neighbor is returned only with high probability, and there is no efficient way of 
detecting if the computed result is incorrect. This means that they do not overcome the 
4th drawback above. 


Contribution. In this paper we investigate the possibility of Las Vegas-type guaran¬ 
tees for (c-approximate) nearest neighbor search in Hamming space. Traditional LSH 
schemes pick the sequence of hash functions independently, which inherently implies 
tha t we can only hope for hig h probability boun ds. Extending and improving results 
by [[Greene et al. 1994] and [ Arasu et al. 2 0061 we show that in Hamming space, by 
suitably correlating hash functions we can “cover” all possible positions of r differences 
and thus eliminate false negatives, while achieving performance bounds comparable 
to those of traditional LSH methods. By known reductions [Indyk 20071 this implies 
Las Vegas-type guarantees also for i\ and metrics. Since our methods are based on 
combinatorial objects called coverings we refer to the approach as CoveringLSH. 


Let ]|ir - y\\ denote the Hamming distance between vectors x and y. Our results imply 
the following theorem on similarity search (specifically c-approximate near neighbor 
search) in a standard unit cost (word RAM) model: 

THEOREM 1.1. Given S £ {0, l} d , c > 1 and r e N, we can construct a data structure 
such that for n = |Sj and a value f(n, r, c ) bounded by 

(0(1) if log(n)/(cr) € N 

f(n,r,c) = i (logn)° (1) if cr < log(n)/(3 log log n) , 

[ O (min(n°- 4,/c r, 2 r )) for all parameters 

the following holds: 

— On query y e {0, l} d the data structure is guaranteed to return x e S with ||x - y|| < cr 
if there exists x' e S with \\x' - y\\ < r. 

— The expected query time is O (f(n, r, c) n 1/,c ( 1 + djw )), where w is the word length. 

— The size of the data structure is O ( f(n, r, c) n 1+1 ^ c log n + nd) bits. 


Our techniques, like traditional LSH, extend to efficiently solve other variants of 
similarity search. For example, we can: 1) handle nearest neighbor search without 
knowing a bound on the distance to the nearest neighbor, 2) return all near neighbors 
instead of just one, and 3) achieve high probability bounds on query time rather than 
just an expected time bound. 

When f(n , r, c) =0(1) the performance of our data structure matches that of classical 
LSH with c onstant probability of a false negative [Indyk and Motwani 1998; Gionis 
et al. 19991, s o f(n,r,c) is th e multiplicative overhead compared to classical LSH. In 
fact, ijO’Donnell et al. 20141 showed that the exponent of 1/c in query time is optimal 
for methods based on (data independent) LSH. 
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1.1. Notation 

For a set S and function / we let f(S) = {/( x) \ x e S}. We use 0 and 1 to denote vectors 
of all Os and Is, respectively. For x,y £ {0,1 } d we use x Ay and x v y to denote bit-wise 
conjunction and disjunction, respectively, and x © y to denote the bitwise exclusive-or. 
Let I(x) = {i | Xi - 1). We use ||a;|| = |/(a;)| to denote the Hamming weight of a vector x, 
and 


\\x-y\\ = \I(x®y)\ 

to denote the Hamming distance between x and y. For S £ {0 ,\} d let A s be an upper 
bound on the time required to produce a representation of the nozero entries of a vector 
in 5 in a standard (word RAM) model [Hagerup 1998]. Observe that in general A s 
depends on the representation of vectors (e.g.Tbit vectors for dense vectors, or sparse 
representations if d is much larger than the largest Hamming weight). For bit vectors 
we have A s = O (1 + d/w) if we assume the ability to count the nu mber of Is in a word 
in constant timer] and this is where the term 1 + d/w in Theorem 1.1 comes from. We 
use “x mod b” to refer to the integer in {0, 1} whose difference from x is divisible 

by b. Finally, let (x, y) denote ||m a y\\, i.e., the dot product of x and y. 

2. BACKGROUND AND RELATED WORK 

Given S £ {0, l} d the problem of searching for a vector in S within Hamming distance r 
from a given query vector y was introduced by Minsky and Papert as the approximate 
dictionary problem [Minsky and Papert 1987| . The generalization to arbitrary spaces is 
now known as the near neighbor problem (or sometimes as point location in balls). It 
is known that a solution to the approximate near neighbor problem for fixed r (known 
before query t ime) implies a solution to the nearest neighbor pro blem with comparable 


performance [Irulyk and Motwani 1998 Har-Peled et al. 2012]. In our case this is 


somewhat simpler to see, so we give the argument for completeness. Two reductions 
are of interest, depending on the size of d. If d is small we can obtain a nearest neighbor 
data structure by having a data structure for every radius r, at a cost of factor d in space 
and log d in query time. Alternatively, if d is large we can restrict the set of radii to the 
O (log(n) log(d)) radii of the form [(1 + 1/logn) ' ] < d. This decreases the approximation 
factor needed for the near neighbor data structures by a factor 1 + 1/ log n, which can be 
done with no asymptotic cost in the data structures we consider. For this reason, in the 
following we focus on the near neighbor problem in Hamming space where r is assumed 
to be known when the data structure is created. 

2.1. Deterministic algorithms 

For simplicity we will restrict attention to the case r < d/2. A baseline is the brute force 
algorithm that looks up all ( d ) bit vectors of Hamming distance at most r from y. The 
time usage is at least ( d/r) r , assuming r < d/2, so this method is not attractive unless d r 
is quite small. The dependence on d was reduced by ICole et al. 2004] who achieve 
query time O (d + log r n) and space O (nd + n log r n). Again, because of the exponential 
dependence on r this method is interesting only for small values of r. 

2.2. Randomized filtering with false negatives 

In a seminal paper llndyk and Motwani 19981, Indyk and Motwani presented a ran¬ 
domized solution to the c-approximate near neighbor problem where the search stops as 


1 This is true on modern computers using the POPCNT instruction, and implementable with table lookups 
if ui = O (logn). If only a minimal instruction set is available it is possible to get Ag = O ( d/w + logui) by a 
folklore recursive construction, see e.g. [[Hagerup et al. 2001| Lemma 3.2], 
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soon as a vector within distance cr from y is found. Their technique can also be used to 
solve the approximate dictionary problem, but the time will then depend on the number 
of points at distance between r + 1 and cr that we inspect. Their data structure, like all 
LSH methods for Hamming space we consider in this paper, uses a set of functions from 
a Hamming projection family: 

Ha = {a; >-* x a a \ a e A } (1) 

where A £ {0, l} d . The vectors in A will be referred to as bit masks. Given a query y, 
the idea is to iterate through all functions h e Ha and identify collisions h(x) = h(y) for 
x e S, e.g. using a hash table. This procedure covers a query y if at least one collision 
is produced when there exists x e S with \\x - y\\ < r, and it is efficient if the number of 
hash function evaluations and collisions with |].t - y\\ > cr is not too large. The procedure 
can be thought of as a randomized filter that attempts to catch data items of interest 
while filtering away data items that are not even close to being interesting. The filtering 
efficiency with respect to vectors x and y is the expected number of collisions h(x) = h (y ) 
summed over all functions h e Ha, with expectation taken over any randomness in the 
choice of A. We can argue that without loss of generality it can be assumed that the 
filtering efficiency depends only on \\x - //, and not on the location of the differences. 
To see this, using an idea from [Arasu et al. 2006) , consider replacing each a e A by 
a vector n(a) defined by ir(a)i = a 7r ( i ), where n : {1,... ,d} -*■ {l,...,d} is a random 
permutation used for all vectors in A. This does not affect distances, and means that 
collision probabilities will depend solely on ||x - y||, d, and the Hamming weights of 
vectors in A. 


Remark. If vectors in A are sparse it is beneficial to work with a sparse representation 
of the input and output of functions in Ha, and indeed this is what is done by Indyk and 
Motwani who consider functions that concatenate a suitable number of 1-bit samples 
from x. However, we find it convenient to work with d-dimensional vectors, with the 
understanding that a sparse representation can be used if d is large. A 


Classical Hamming LSH. Indyk and Motwani use a collection 

A{R) = (a(i>) | v e f?}, 

where R £ {1,... ,d} k is a set of uniformly random and independent fc-dimensional 
vectors. Each vector v encodes a sequence of k samples from {1,..., d}, and a(v) is the 
projection vector that selects the sampled bits. That is, «(<;), = 1 if and only if Vj = i 
for some j e {1,..., k}. By choosing k appropriately we can achieve a trade-off that 
balances the size of R (i.e., the number of hash functions) with the expected number 
of collisions at distance cr. It turns out that \R\ = O (n 1,/c log(l/i5)) suffices to achieve 
collision probability 1 - S at distance r while keeping the expected total number of 
collisions with “far” vectors (at distance cr or more) linear in \R\. 


Newer developments. In a recent advance of [Andoni and Razenshteyn 20151, extend¬ 
ing preliminary ideas from l Andoni et al. 2014) , it was shown how data dependent 
LSH can achieve the same guarantee with a smaller family (having 1} space usage 
and evaluation time). Specifically, it suffices to check collisions of O (n p log(l/<5)) hash 
values, where p = + o(l). We will not attempt to generalize the new method to the 

data dependent setting, though that is certainly an interesting possible extension. 

In a surprising development, it was recently shown [Alman and Williams 20151 that 
even with no approximation of distances (c = 1) it is possible to obtain truly sublinear 
time per query if: 1) d = O (logn) and, 2) we are concerned with the answers to a batch 
of n queries. 
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2.3. Filtering methods without false negatives 

The literature on filtering methods for Hamming distance that do not introdu ce fa lse 
negatives, but still yield formal guarantees, is relatively small. As in section 2.2 the 
previous results can be stated in the form of Hamming projection families (jTjl. We 
consider constructions of sets A that ensure collision for every pair of vectors at distance 
at most r, while at the same time achieving nontrivial filtering efficiency for larger 
distances. 

Choosing error probability 6 < l/( d ) in the construction of Indyk and Motwani, we 
see that there must exist a set R* of size O (log( l/d)n^ r ) that works for every choice 


of r mismatching coordinates, i.e., ensures collision under some h £ 'H 


A(R*) 


for all 


pairs of vectors within distance r. In particular we have |A(f?*)| = O (dn 1 ' 0 ). However, 
this existence argument is of little help to design an algorithm, and hence we will be 
interested in explicit constructions of LSH families without false negatives] 5 ] 


Kuzjurin has given such explicit constructions of “covering” vectors [Kuzjurin 20001 
but in general the boun ds achieved are far fr om what is possible existentially [Kuzjurin 
19951. Independently, [Greene et al. 1994) linked the question of similarity search 


without false negatives to the Turan problem in extremal graph theory. While optimal 
Turan numbers are not known in general, Greene et al. construct a family A (based on 
corrector hypergraphs) that will incur few collisions with random vectors, i.e., vectors at 
distance about d/2 from the query point j^] [ Gordon et al. 19951 presented near-optimal 
coverings for certain parameters based on finite geometries — in section [5] we will use 
their construction to achieve good data structures for small r. 

[Arasu et al. 2006] give a construction that is able to achieve, for example, o(l) 
filtering efficiency for approximation factor c > 7.5 with \A\ = O (r 2 ' 39 ). Observe that 
there is no dependence on d in these bounds, which is crucial for high-dimensional 
(sparse) data. The technique of [Arasu et al. 20061 allows a range of trade-offs between 
\A\ and the filtering efficiency, determined by parameters n-\ and na . No theoretical 
analysis is made of how close to 1 the filtering efficiency can be made for a given c, but 
it seems difficult to significantly improve the constant 7.5 mentioned above. 

Independently of the work of |Arasu et al. 2 0061, “lossless” methods for near neighbor 
search ha ve been studied in the contexts of approximate pattern matching [Kucherov 


et al. 20051 and computer vision [Norouzi et al. 20121. The analytical part of these 
papers differs from our setting by focusing on filtering efficiency for random vectors, 
which means that differences between a data vector and the query appear in random 
locatio ns. I n particular there is no need to permute the dimensions as described in 
section |2.2| Such schemes aimed at random (or more generally “high entropy”) data 
become efficient when there are fe w vectors within distance r log .S' of a q uery point. 
Another variation of the scheme of [Arasu et al. 20061 recently appeared in [Deng et al. 
[20151 

3. BASIC CONSTRUCTION 

Our basic CoveringLSH construction is a Hamming projection family of the form ([lj. 
We start by observing the following simple property of Hamming projection families: 


LEMMA 3.1. For every A £ {0,l} d , every h e and all x,y 
h(x) = h{y) if and only ifh(x ffi y) = 0. 


{0, l} d we have 


2 (indyk 2000 j sketched a way to verify that a random family contains a collidin g fun ction for every pair of 
vectors within distance r, but u nfortunately the con struction is incorrect llndyk 2015). 

3 It appears that Theorem 3 of [Greene et al. 1994] does not follow from the calculations of the paper — a 
factor of about 4 is missing in the exponent of space and time bounds (Parnas 2015) . 


ACM Transactions on Algorithms, Vol. 0, No. 0, Article 0, Publication date: 0. 













































0:6 


R. Pagh 


1 0 1 0 1 0 1 
0 110 0 11 
110 0 110 
0 0 0 1 1 1 1 
10 110 10 
0 11110 0 
1 1 0 1 0 0 1 

Fig. 1. The collection A-j corresponding to nonzero vectors of the Hadamard code of message length 3. The 
resulting Hamming projection family > see is 2-covering since for every pair of columns there exists a 
row with Os in these columns. It has weight 4/7 since there are four Is in each row. Every row covers 3 of the 
21 pairs of columns, so no smaller 2-covering family of weight 4/7 exists. 


PROOF. Let a e A be the vector such that h(x) = x a a. We have h(x) = h(y) if and 
only if cii t 0 => Xi = t/j. Since Xi = yi <=>■ (x ffi y)i = 0 the claim follows. □ 


Thus, to make sure all pairs of vectors within distance r collide for some function, we 
need our family to have the property (implicit in the work of |Arasu et al. 20061) that 
every vector with Is in r bit positions is mapped to zero by some function, i.e., the set of 
Is is “covered” by zeros in a vector from A. 


Definition 3.2. For A £ {0, 1 } d , the Hamming projection family 'Ha is r-covering if 
for every x e {0, l} d with ||cc|| < r, there exists h e Ha such that h(x) = 0. The family is 
said to have weight to if ||a|| > wd for every a e A. 


A trivial r --covering family uses A - {0}. We are interested in r-covering families that 
have a nonzero weight chosen to make collisions rare among vectors that are not close. 
Vectors in our basic r-covering family, which aims at weight around 1/2, will be indexed 
by nonzero vectors in (0,1} T ' +1 . The family depends on a function m : {1,... ,d} -*■ {0, l} r+1 
that maps bit positions to bit vectors of length r + 1. (We remark that if d < 2 r+1 - 1 and 
m is the function that maps an integer to its binary representation, our construction is 
identical to known coverings based on finite geometry [Gordon et al. 1995 1; however we 
give an elementary presentation that does not require knowledge of finite geometry.) 
Define a family of bit vectors a(v) e {0, l} d by 


h t „ v= f 0 if (m(i),v) = 0 mod 2, 
^ 1 1 otherwise 


( 2 ) 


where ( m(i),v) is the dot product of vectors m(i) and v. We will consider the family of 
all such vectors with nonzero v: 


A(m) = {a('c) 

Figure [l] shows the family A(rn) for r = 
of*. 


| ve { 0 ,ir 1 \{ 0 }} . 

2 and m(i ) equal to the binary representation 


LEMMA 3.3. For every m : {1, -*■ {0, l} r+1 , the Hamming projection family 

HA(m) is r-covering. 

PROOF. Let x 6 {0, l} d satisfy ||x|| < r and consider a(v) e A(m ) as defined in ([2|. It is 
clear that whenever * e {1,... ,d}\I(x) we have (a(v) Ai) t = 0 (recall that I(x ) = {* | = 

1}). To consider (a(v) a x), for i € l (x) let M x - m(/(x)), where elements are interpreted 
as r + 1-dimensional vectors over the field F 2 . The span of M x has dimension at most 
| M x | < ]|:/;]| < r, and since the space is r + 1-dimensional there exists a vector v x + 0 that is 
orthogonal to span ( M x ) . In particular (v x ,rri(i)) mod 2 = 0 for all i e I(x). In turn, this 
means that a(v x ) a x = 0, as desired. □ 
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If the values of the function m are “balanced” over nonzero vectors the family 'H A (.m) 
has weight close to 1/2 for d » 2 r . More precisely we have: 

LEMMA 3.4. Suppose I?77. _ 1 (V)| > [d/2 r+1 J for each v € {0, l} r+1 and m -1 (0) = 0. Then 
'HA(m) has weight at least 2 r [d/2 r+1 \/d> (l- /2. 

PROOF. It must be shown that |]a(i;)]| > 2 r [d/2 r+1 J for each nonzero vector v. Note 
that v has a dot product of 1 with a set V £ {0, 1 } r+l of exactly 2 r vectors (namely the 
nontrivial coset of v’s orthogonal complement). For each v' e V the we have o{v), = 1 for 
all um 1 (r'). Thus the number of Is in a(v ) is: 

E \m-\v')\>2 r [d/2 r+1 \>(l- 2 -h)d/2 . 
v'eV 

□ 


Comment on optimality. We note that the size = 2 r+1 -1 is close to the smallest 

possible for an r-covering families with weight around 1/2. To see this, observe that Q 
possible sets of errors need to be covered, and each hash function can cover at most 
( c ^ 2 ) such sets. This means that the number of hash functions needed is at least 

> 2 r 



which is within a factor of 2 from the upper bound. A 


Lemmas 3.3 and 3.4 leave open the choice of mapping m. We will analyze the setting 
where m maps to values chosen uni formly and independently from {0, l} r+1 . In this 
setting the condition of Lemma |3.4| will in general not be satisfied, but it turns out 
that it suffices for rn to have balance in an expected sense. We can relate collision 
probabilities to Hamming distances as follows: 

THEOREM 3.5. For all x, y e {0, l} d and for random m : {1,..., d} -»■ {0, l} r+1 , 

(1) If\\x-y\\ < r then Pr[lh e / H A ( TO ) : h(x) = h(y )] = 1. 

(2) E [\{h e n A(m) | h(x) = h(y)}\] < 2^-^. 

PROOF. Let z = x®y. For the first part we have \\x - 2/||_f_[U|| < r. Lemma |3. 3 1 states 
that there exists h € 'H A (m) such that h{z ) = 0. By Lemma 3.1 this implies h(x) = h(y). 

To show the second part we fix v e {0, l} r+1 \{0}. Now consider a(v) e A(m), defined 
in d2l, and the corresponding function h(x) = x a a(v) e For i e I (z) we have 

h(z) i = 0 if and only if a(v)i = 0. Since m is random and v t 0 the a(v)i values are 
independent and random, so the probability that a(r), ; = 0 for all i e I(z) is 2 z l = 2» x ~ v ^. 
By linearity of expectation, summing over 2 r+1 - 1 choices of v the claim follows. □ 


Comments. A few remarks on Theorem 3.5 (that can be skipped if the reader wishes 
to proceed to the algorithmic results): 


— The vectors in A(m) can be seen as samples from a Hadamard code consisting of 
2 r+1 vectors of dimension 2 r+1 , where bit i of vector j is defined by (i, j) mod 2, again 
interpreting the integers i and j as vectors in F'j. Nonzero Hadamard codewords 
have Hamming weight and minimum distance 2 r+1 . However, it does not seem that 
error-correcting ability in general yields nontrivial ? -covering families. 
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— The construction can be improved by changing m to map to {0, l} r+1 \{0} and/or 
requiring the function values of m to be balanced such that the number of bit 
positions mapping to each vector in {0,1 } r+l is roughly the same. This gives an 
improvement when d « 2 r but is not significant when d is much smaller or much 
larger than 2 r . To keep the exposition simple we do not analyze this variant. 

— At first glance it appears that the ability to avoid collision for CoveringLSH (“filter¬ 
ing”) is not significant when ||x - y\\ = r + 1. However, we observe that for similarity 
search in Hamming space it can be assumed without loss of generality that either 
all distances from the query point are even or all distances are odd. This can be 
achieved by splitting the data set into two parts, having even and odd Hamming 
weight, respectively, and handling them separately. For a given query y and radius r 
we then perform a search in each part, one with radius r and one with radius r - 1 
(in the part of data where distance r to y is not possible). This reduces the expected 
number of collisions at distance r + 1 to at most 1/2. A 


Nearest neighbor. Above we have assumed that the search radius r was given in 
advance, but it turns out that CoveringLSH supports also supports finding the nearest 
neighbor, under the condition that the distance is at most r. To see this, consider the 
subfamily of A(m) indexed by vectors of the form Qr+i ri Wij w here v\ e {0, l} ri+1 \{0} for 
some n < r, then collision is guaranteed up to distance r\. That is, we can search for a 
nearest neighbor at an unknown distance in a natural way, by letting m map randomly 


to {0, l}l logn l and choosing v as the binary representation of 1,2,3,... (or a lternatively, 
the vectors in a Gray code for {0, l}l Iogra l). In either case Theorem 3.5 implies the 


invariant that the nearest neighbor has distance at least [logrj, where v is interpreted 
as an integer. This means that when a point x at distance at most c [log(r + 1) J is found, 
we can stop after finishing iteration v and return x as a c-approximate nearest neighbor. 
Figure [2] gives pseudocode for data structure construction and nearest neighbor queries 
using CoveringLSH^] 


3.1. Approximation factor c = log(n)/r 


We first consider a case in which the method above directly gives a strong result, namely 
when the threshold cr for being an approximate near neighbor equals log n. Such a 
threshold may be appropriate for high -entropy data sets of dimension d> 2 log n w here 
most distances tend to be large (see [Kucherov e t al. 2005; Norouzi et al. 2012] for 
discussion of such settings). In this case Theorem |3 .5 1 implies efficient c-approximate 
near neighbor search in expected time O (As2 r ) = O (Ag n'C), where A s bounds the 
time to compute the Hamming distance between query vector y and a vector x e S. This 
matches the asymptotic time complexity of [Indyk and Motwani 19981. 

To show this bound observe that the expected total number of collisions h(x) = h(y), 

y\\ > logn, is at most 2 r+1 . This means 


summed over all h eH 


A(m) 


and x e S with 


that computing h(y) for each h e 'HA(m) and computing the distance to the vectors that 
are not within distance cr but collide with y under some h € can be done in 


expected time O ( A.y2 r ). The expected bound can be supplemented by a high probability 
bound as follows: Restart the search in a new data structure if the expected time is 
exceeded by a factor of 2. Use O (logn) data structures and resort to brute force if this 
fails, which happens with polynomially small probability in n. 

What we have bounded is in fact performance on a worst case data set in which 
most data points are just above the threshold for being a c-approximate near neighbor. 


4 A corresponding Python implementation is available on github, https: //github. com/rasmus-pagh/coveringLSH. 
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procedure lNITIALIZEC0VERING(d, r) 
for v e {0, l} r+1 do A\v\ ■- 0 d 
for i := 1 to d do 

m := RANDOM({0, l} r+1 \{0}) 
for v € {0, l} r+1 do A[v~\i := (m,v) mod 2 
end for 
end 

function BUILDDATASTRUCTURE(5, r) 

D = 0 

for x e S, v e {0,1} T ' +1 \{0} do 
D\x a A\y]\ := -D[x A A\y]\ u {a;} 

return D 
end 


function NEARESTNEIGHBOR(D, r, y) 
best := oo 
nn := null 

for v := 1 to 2 r+1 - 1 do 

for x e D[y a At[BlTVEC(i>, r + 1)]] do 
if ||rc — 2/|| < best then 
best = |- y]| 
nn = x 
end if 
end for 

if best < [log(v + 1)J then return nn 
end for 
return null 
end 


Fig. 2. Pseudocode for construct ing ( left) and querying (right) a nearest neighbor data structure on a set 


S E {0.1}“ as described in section 3.1 Parameter r controls the largest radius for which a nearest neighbor 


is returned. This is the simplest instantiation of CoveringLSH — it works well on high-entropy data where 
there are few points within distance r + log 2 |S| of a query point. In this setting, given a query point y, the 
expected search time for finding a nearest neighbor x is O ). If only a c-approxiate nearest neighbor is 

sought the condition best < [log(u + 1)J should be changed to best < c[log(u + 1)J. 


Notation: The function RANDOM returns a random element from a given set. The inner product { m , v) can 
be computed by a bitwise conjunction followed by counting the number of bits set (POPCNT). D[i] is used to 
denote the information associated with key i in the dictionary D that is the main part of the data structure; 
if i is not a key in D then D\i\ = 0. The function call BitVec(u, r + 1) typecasts an integer to a bit vector of 
dimension r + 1. Finally, \\x - y\\ denotes the Hamming distance between x and y. 

Other comments: Vectors are stored 2 r+1 - 1 times in D , but may be represented as references to a sin¬ 
gle occurrence in memory to achieve better space complexity for large d. The global dictionary A, which 
contains a covering independent of the set S, must be initialized by INITIALIZECOVERING before BuiLD- 
DATAStruCTURE is called. Note that the function m is not stored, as it is not needed after constructing the 
covering. 


In general the amount of time needed for a search will depend on the distribution of 
distances between y and data points, and may be significantly lower. 

The space required is O (2 r n) = O (n 1+1 / c ) words plus the space required to store the 
vectors in S, again matching the bound of Indyk and Motwani. In a straightforward 
implementation we need additional space O (d) to store the function rn, but if d is large 
(for sets of sparse vectors) we may reduce this by only storing m(i) if there exists x e S 
with Xi + 0. With this modification, storing m does not change the asymptotic space 
usage. For dense vectors it may be more desirable to explicitly store the set of covering 
vectors A(m) rather than the function m, and indeed this is the approach taken in the 
pseudocode. 


Example. Suppose we have a set S of n = 2 30 vectors from {0, l} 128 and wish to search 
for a vector at distance at most r = 10 from a query vector y. A brute-force search within 
radius r would take much more time than linear search, so we settle for 3-approximate 
similarity search. Vectors at distance larger than 3r have collision probability at most 
1/(2 n) under each of the 2 r+1 - 1 functions in h e 'HA( m ), so in expectation there will be 
less than 2 r = 1024 hash collisions between y and vectors in S. The time to answer a 
query is bounded by the time to compute 2047 hash values for y and inspect the hash 
collisions. 
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It is i nstr uctive to compare to the family 'Ha(R) of Indyk and Motwani, described in 
section [272] with the same performance parameters (2047 hash evaluations, collision 
probability 1/(2 n) at distance 31). A simple computation shows that for k = 78 samples 
we get the desired collision probability, and collision probability (1 - r/128) 78 » 0.0018 at 
distance r = 10. This means that the probability of a false negative by not producing 
a hash collision for a point at distance r is (1 - (1 - r/128) 78 ) 2047 > 0.027. So the risk 
of a false negative is nontrivial given the same time and space requirements as our 
“covering” LSH scheme. A 

4. CONSTRUCTION FOR LARGE DISTANCES 

Our basic construction is only efficient when cr has the “right” size (not too small, not 
too large). We now generalize the construction to arbitrary values of r, cr, and n, with a 
focus on efficiency for large distances. In a nutshell: 

— For an arbitrary choice of cr (even much larger than log n) we can achieve perfor¬ 
mance that differs from classical LSH by a factor of ln(4) < 1.4 in the exponent. 

— We can match the exponent of classical LSH for the c-approximate near neighbor 
problem whenever [log n]/(cr) is (close to) integer. 

We still use a Hamming projection family l[l|, changing only the set A of bit masks used. 
Our data structure will depend on parameters c and r, i.e., these can not be specified as 
part of a query. Without loss of generality we assume that cr is integer. 


Intuition. When cr < log n we need to increase the average number of Is in the bit 
masks to reduce collision probabilities. The increase should happen in a correlated 
fashion in order to maintain the guarantee of collision at distance r. The main idea is 
to increase the fraction of Is from 1/2 to 1 - 2 -t , for / e N, by essentially repeating the 
sampling from the Hadamard code t times and selecting those positions where at least 
one sample hits a 1. 

On the other hand, when cr > log n we need to decrease the average number of Is 
in the bit masks to increase collision probabilities. This is done using a refinement of 
the partitioning method of [Arasu et al. 2006] which distributes the dimensions across 
partitions in a balanced way. The reason this step does not introduce false negatives 
is that for each data point x there will always exist a partition in which the distance 
between query y and x is at most the average across partitions. An example is shown in 
figure [3] A 


We use b, q e N to denote, respectively, the number of partitions and the number of 
partitions to which each dimension belongs. Observe that if we distribute q copies of r 
“mismatching” dimensions across b partitions, there will always exist a partition with at 
most r' = [rq/b\ mismatches. Let Intervals/^, q ) denote the set of intervals in {1,..., b} 
of length q, where intervals are considered modulo b (i.e., with wraparound). We will 
use two random functions, 


m:{l,...,d}-({0,ir' +1 ) t 


s : {1,..., d} -»■ Intervals/!;, q) 

to define a family of bit vectors a(v, k) e {0, l}' 1 , indexed by vectors v e {0, l} tr +1 and 
k e {1,..., b}. We define a family of bit vectors a{y , k ) e {0, l} d by 

a(v , k)i = s~ 1 (k) i a | '\J{m(i)j,v) mod 2 + 0 j , (3) 
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Fig. 3. The collection .42x7 containing two copies of the collection A 7 from figure[l] one for each half of the 
dimensions. The resulting Hamming projection family ‘Hj l , 2x7 , see Jl|, is 5-covering since for set of 5 columns 
there exists a row with Os in these columns. It has weight 4/14 since there are four Is in each row. Every 
row covers ( L 1 ) sets of 5 columns, so a lower bound on the size of a 5-covering collection of weight 4/14 is 

O(b 0 )i = 8 - 


where s~ 1 (k) is the preimage of k under s represented as a vector in {0, l} d (that is, 
s~ 1 (k) i = 1 if and only if s(i) = k), and (m(i)j,v) is the dot product of vectors m(i)j and v. 
We will consider the family of all such vectors with nonzero v: 

A(m, s) = {a(w, k) \ v e {0, l} tr +1 \{0}, k e {1,..., 6}} . 

Note that the size of ,4 (to, s) is b{2 tr +1 - 1) < 2b2 trq ^ b . 

LEMMA 4.1. For every choice ofb , d,q,t e N, and every choice of functions m and s as 
defined above, the Hamming projection family 'U A ( m ,s) is r-covering. 

PROOF. Let x e {0, l} d satisfy ||j;| < r. We must argue that there exists a vector 
v* 6 {0, l} tr +1 \{0} and k* e {1,..., b} such that a(v*,k*) a x = 0, i.e., by ^ 

Vi: Xi a s" 1 (fe) i a ( \/(m(i)j,v) mod 2 + 0) = 0 . 
j 

We let k* = argmin \\x a s _1 (fc)||, breaking ties arbitrarily. Informally, k* is the partition 
with the smallest number of Is in x. Note that YX=i \\ x as _1 (/c)]| = ( l r so by the pigeonhole 
principle, ||m a s _ 1 (/c*)|] < [ rq/b\ = r '. Now consider the “problematic” set I(x a s~ 1 {k*)) of 
positions of Is in a; a s~ 1 (k*), and the set of vectors that to associates with it: 

M x = {m(I(x/\s~ 1 (k*))) j | j 6 {l,...,t}} . 

The span of M x has dimension at most \M X \ < tr'. This means that there must exist 
v* e {0,1 } tT +1 \{0} that is orthogonal to all vectors in M x . In particular this implies that 
for each i e I x we have \J mod 2 t 0 is false, as desired. □ 

We are now ready to show the following extension of Theorem |3.5[ 

THEOREM 4.2. For random m and s, for every b , d,q,r,t s N and x,y e {0, l} d : 

(1) \\x -y\\<r^Pr [lh e ■ h{x) = h(y)] = 1. 

(2) E [\{h e H A(m>s) | h(x) = h(y)}\] < (l - (1 - 2~ t )qlbf X ~ V h2 trq l b+ \ 
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PROOF. By Lemma 3.1 we have h(x ) = h(y ) if and only if h(z) = 0 where z = 


So the first part of the theorem is a consequence of Lemma |4.1| 
consider a particular vector a(v,k), where v is nonzero, and tt 
value h(z) = za a(v, k). We argue that over the random choice of m and s we have, for 
each i: 


_ For the second part 
le corresponding hash 


Pr[a(w,fc)i = 0] = Pr[s 1 (fc)i = 0] +Pr[s 1 (k) i = 1 a Vj : ( m(i)j,v) = 0 mod 2] 

= (l-q/b) + 2- t q/b (4) 

= l-(l-2 -*)q/b . 

The second equality uses independence of the vectors {m(i)j j = and s(i), 

and that for each j we have Pr [(m(i)j,v) = 0 mod 2] = 1/2. Observe also that a(v,k)i 
depends only on s(i) and m(i). Since function values of s and rn are independent, so are 
the values 


{a(v,k)i | * e {1,... , d}} . 

This means that the probability of having a(v, k)i = 0 for all i where Zi = 1 is a product 
of probabilities from (|4]): 

Pr [h(x) = h(y )] = n (1 - (1 - 2-‘)g/6) = (l - (1 - 2 ^q/bf^ . 

iel z 

The second part of the theorem follows by linearity of expectation, summing over the 
vectors in A(m, s). □ 


4.1. Choice of parameters 

The expected time complexity of c-approximate near neighbor search with radius r 
is bounded by the size \A\ of the hash family plus the expected number k a of hash 
collisions between the query y and vectors S that are not c-approximate near neighbors. 
Define 


Sfar = {x e S | \\x - y\\ > cr} and k a = E[|{(x,h) e S {ar xH A I h(x) = h(y)}\] 

where th e exp ectation is over the choice of family A. Choosing parameters t, b, and q in 
Theorem 14. 2 in order to get a family A that minimizes \A\ + n A is nontrivial. Ideally we 
would like to balance the two costs, but integrality of the parameters means that there 
are “jumps” in the possible sizes and filtering efficiencies of TL A ( m , s )- Figure |4j shows 
bounds achieved by numerically selecting the best parameters in different settings. We 
give a theoretical analysis of some choices of interest below. In the most general case 
the strategy is to reduce to a set of subproblems that hit the “sweet spot” of the method, 
i.e., where \A\ and n A can be made equal. A 


COROLLARY 4.3. For every c > 1 there exist explicit, randomized r-covering Ham¬ 
ming projection families Ti Al , 7i A2 such that for every y e {0, l} d : 

(. 1) 1-Ail < 2 r+ V/ c and k Ai < 2 r+1 n 1 / c . 

(2) If \og(n)/(cr) + e e N, for e > 0, then \A\\ < 2 Er+1 n 1 / c and k Ai < 2 er+1 n 1 ^ c , 

(3) If r> [In (n)/c\ then [A 2 | < 8rn ln ^ 4 ^ c and u A , 2 < 8rn ln ^ 4 ^ c . 

PROOF. We let Ai = A(m, s) with b = q = 1 and t = [log(n)/(cr)]. Then 
[Ai| < 2b2 trq/b = b2 tr+1 < 2 C°g( ra )/( cr ) +1 ) r ' +1 = 2 r+1 n 1/c . 

Summing over x € ,S'f ar the second part of Theorem | l,2| yield.s: 

k Ai < n2- tcr 2 tr+1 < 2 tr+1 < 2 r+1 n 1/c . 
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Similarity search, radius 16, approx, factor 2 

Time 



— Linear search 

— Exhaustive search in Hamming ball 
— Classical LSH, error prob. 1/n 
— Classical LSH, error prob. 1% 

— CoveringLSH (1 partition) 

n (size of set) 


Similarity search, radius 256, approx, factor 2 

Time 



— Linear search 
— Classical LSH, error prob. 1/n 
— Classical LSH, error prob. 1% 
— CoveringLSH (multi-partition) 

n (size of set) 


Fig. 4. Expected number of memory accesses for different similarity search methods for finding a vector 
within Hamming distance r of a query vector y. The plots are for r = 16 and r = 256, respectively, and are 
for a worst-case data set where all points have distance 2r from y, i.e., there exists no c-approximate near 
neighbor for an approximation factor c < 2. The bound for exhaustive search in a Hamming ball of radius r 
optimistically assumes that the number of dimensions is log 2 n, which is smallest possible for a data set of 
size n (for r = 256 this number is so large that it is not even shown). Two bounds are shown for the classical 
LSH method of Indyk and Motwani: A small fixed false negative probability of 1%, and a false negative 
probability of 1/n. The latter is what is needed to ensure no false negatives in a sequence of n searches. The 
bound for CoveringLSH in the case r = 16 uses a single partition (b = 1), while for r = 256 multiple partitions 
are used. 


For the second bound on A\ we notice that the factor 2 r is caused by the rounding in 
the definition of t, which can cause 2 tr to jump by a factor 2 r . When log(n)/(cr) + e is 
integer we instead get a factor 2 er . 


Finally, we let A 2 
bounded by b2 trq l b+1 
over x e S far : 


= A(m,s) with b = r, q = 2[ln(n)/c], and t - 1 Th e size of A 2 is 
< r 2 2 h| (")/f'+ 3 = 8rn ln ( 4 ^ c . Again, by Theorem 4.2 and summing 


k A 2 < n (1 - <?/(2r)) cr r 2 9+1 < nexp (-qc/2)) r 2 q+1 < r 2 q+1 < 8rn ln ^ 4 ^ c , 


where the second inequality follows from the fact that 1 - a < exp(-o) when a > 0. □ 


5. CONSTRUCTION FOR SMALL DISTANCES 

In this section we present a different generalization of the basic construction of Section[3] 
that is more efficient for small distances, cr < log(n)/(31oglogn), than the construction 
of Section [4] The existence of asymptotically good near neighbor data structures for 
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small distances is not a big surprise: For r - o(log(n)/log log n) it is known how to 
achieve query time n°W [Cole etal. 20041, even with c = 1. In practice this will most 
likely no faster than linear search for realistic values of n except when r is a small 
constant. In contrast we seek a method that has reasonable constant factors and may 
be useful in practice. 

The idea behind the generalization is to consider vectors and dot products modulo p 
for some prime p > 2, This corresponds to using finite geometry coverings over the 
field of size p [Gordon et al. 19951, but like in Section [3] we make an elementary 
presentation without explicitly referring to finite geometry. Vectors in the r-covering 
family, which aims at weight around 1 - 1/p, will be indexed by nonzero vectors in 
{0,... ,p - l} r+1 . Generalizing the setting of Section|3j the family depends on a function 
m : {1,..., d} -> {0,... ,p - l} r+1 that maps bit positions to vectors of length r + 1. Define 
a family of bit vectors a(v) e {0, l} d , ce{0,...,p- l} r+1 by 


n(,A =1 0 if (™(*)u} = 0modp, 
' '* 1 1 otherwise 


(5) 


for all i e {1,..., d}, where (m(i), v) is the dot product of vectors m(i) and v. We will 
consider the family of all such vectors with nonzero v: 

A(m) = {a(v)\ve{0,...,p-l} r+1 \{0}} . 

LEMMA 5 . 1 . For every m ■ {1,..., d} -*■ {0,... ,p - l} r+1 , the Hamming projection 
family is r-covering. 

PROOF. Identical to the proof of Lemma [3 .3[ The only difference is that we consider 
the field F p of size p. □ 

Next, we relate collision probabilities to Hamming distances as follows: 

THEOREM 5 . 2 . For all x,y e ( 0 , l} d and for random m : { 1 ,..., d} -*■ {0, ... ,p- l} r+1 , 


(1) If\\x-y\\ < r then Pr^lh e H^m) ' K x ) = Kv)] = L 

(2) E [Kft e U Mm) | h(x) = h(y)}\] < p-HI^II. 


Proof. The proo f is c ompletely analogous to that of Theorem |3.5| The first part 
follows from Lemma|5.1l For the second part we use that Pr [(m(i),v) = 0 mod p] = 1/p 
for each v * 0 and that we are summing over p r+1 - 1 values of v. □ 


Now suppose that cr < log(n)/(31oglogn) and let p be the smallest prime number 
such that p cr > n, or in other words the smallest prime p > n 1 ^ cr \ We refer to the family 
A(m) with this choice of p as A 3 , and note that \A:>,\ < p r+1 . 

By the second part of Theorem 15.2|the expected total number of collisions h(x) = h(y), 
summed over all h e and x e S with ||m - p|| > cr, is at most p r+1 . This means that 
computing h(y) for each h e H A3 and computing the distance to the vectors that are not 
within distance cr but collide with y under some h e 'Ha., can be done in expected time 

0(A sP r )- 

What remains is to bound p r+1 in terms of n and c. According to results on prime gaps 
(see e.g. [Dudek 20141 and its references) there exists a prime between every pair of 
cubes a 3 and (o + l )' 1 for a larger than an explicit constant. We will use the slightly 
weaker upper bound (1 + 4 /a)a 3 > (a + l) 3 , which holds for a > 4. If n exceeds a certain 
constant, since p is the smallest such prime, choosing a - n l h :icT ' > we havep < (1 + 4/a)a: 3 . 
By our upper bound on cr we have a > v ]c '" ,0 A n '>! lo s n _ ] 0 g TL Using r +1 < log(n) we have 


|^ 3 1 < P r+i < ((1 + 4/a)cr ) r+i < (1 + 4/logn) 


\ log n , 4 

> & n cr < e n cr 


( 6 ) 
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Improvement for small r. To asymptotically improve this bound for small r we observe 
that without loss of generality we can assume that cr > log(n)/(61oglogn): If this is not 
the case move to vectors of dimension dt by repeating all vectors t times, where t is 
the largest integer with crt < log(n)/(3 log logn). This increases all distances by a factor 
exactly t < logn, and increases A s by at most a factor t < logn. Then we have: 


|A 3 | < p r+1 < = n 1/c+1/(cr) < n 1/c (logn) e 


(7) 


That is, the expected time usage of p r+1 matches the asymptotic time complexity of [In 


dyk and Motwani 19981 up to a polylogarithmic factor. 


Comments. In principle, we could combine the construction of this section with parti¬ 
tioning to achieve improved results for some parameter choices. However, it appears 
difficult to use this for improved bounds in general, so we have chosen to not go in that 
direction. The constant 3 in the upper bound on cr comes from bounds on the maximum 
gap between primes. A proof of Cramer’s conjecture on the size of prime gaps would 
imply that 3 can be replaced by any constant larger than 1, which in turn would lead to 
a smaller exponent in the polylogarithmic overhead. 


6. PROOF OF THEOREM?? 

The data structure will choose either A\ or A > of Corollary |4.3| or A 3 of section [5] with 
size bounded in (ml, depending o n wh ich i e { 1 , 2 , 3} minimizes \Ai\ + k_^.. The term n 0A ^ c 
comes from part (3) of Corollary 14. 3 1 and the inequality ln(4) < 1.4. 

The resulting space usage is O {\Ai\n\ogn + nd ) bits, representing buckets by list of 
pointers to an array of all vectors in S. Also observe that the expected query time is 
bounded by \A l | + ■ □ 


7. CONCLUSION AND OPEN PROBLEMS 

We have seen that, at least in Hamming space, LSH-based similarity search can be 
implemented to avoid the problem of false negatives at little or no cost in efficiency 
compared to conventional LSH-based methods. The methods presented are simple 
enough that they may be practical. An obvious open problem is to completely close the 
gap, or show that a certain loss of efficiency is necessary (the non-constructive bound in 
section [273] shows that the gap is at most a factor O (d)). 

It is of interest to investigate the possible time-space trade-offs. CoveringLSH uses 
superlinear space and employs a data independent family of functions. Is it possible to 
achieve covering guarantees in linear or near-linear space? Can data structures with 
very fast queries and polynomial space usage match the performance achievable with 
false negatives [Laarhoven 20151? 

Another interesting question is what results are possible in this direction for other 
spaces and distance measures, e.g., t\ , /: 2 , or l x . For example, a more practical alterna¬ 
tive to the reduction of [Indyk 20071 for handling t\ and /: 2 would be interesting. 

Finally, CoveringLSH is data independent. Is it possible to improve performance by 
using data dependent techniques? 
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